Skip to content

Conversation

Alex-PLACET
Copy link
Member

No description provided.

@Alex-PLACET Alex-PLACET requested a review from Copilot October 3, 2025 14:00
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Introduces a new streaming-based serialization architecture replacing vector-returning functions with pluggable output_stream abstractions (memory, file, chunked) and adds higher-level serializer / chunk_serializer utilities. Refactors flatbuffer construction logic into flatbuffer_utils, moves type mapping out of utils, and adds size‑estimation helpers for preallocation.

  • Added output_stream interface with memory, file, and chunked implementations plus new (chunk_)serializer classes
  • Refactored flatbuffer and body serialization into modular helpers; functions now write directly to streams
  • Added size estimation utilities (calculate_*_message_size / calculate_total_serialized_size) and extensive new tests

Reviewed Changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
src/utils.cpp Removes flatbuffer type logic; retains parsing & alignment helpers
src/flatbuffer_utils.cpp / include/sparrow_ipc/flatbuffer_utils.hpp New central flatbuffer construction and buffer/node utilities
src/serialize_utils.cpp / include/sparrow_ipc/serialize_utils.hpp Stream-oriented serialization helpers and size calculators
src/serialize.cpp / include/sparrow_ipc/serialize.hpp Stream-based schema & record batch serialization entry points
src/serializer.cpp / include/sparrow_ipc/serializer.hpp Adds serializer class for continuous IPC stream writing
src/chunk_memory_serializer.cpp / include/sparrow_ipc/chunk_memory_serializer.hpp Adds chunked (per-message vector) serialization
include/sparrow_ipc/output_stream.hpp Defines abstract output_stream interface
include/sparrow_ipc/memory_output_stream.hpp In-memory implementation
include/sparrow_ipc/file_output_stream.hpp / src/file_output_stream.cpp File-backed implementation
include/sparrow_ipc/chunk_memory_output_stream.hpp Chunked multi-vector output stream (name misspelled)
tests/* Updated & expanded tests for new streaming APIs and utilities
CMakeLists.txt / tests/CMakeLists.txt Adds new sources & headers to build system
include/sparrow_ipc/utils.hpp API changes: align_to_8 signature and new parse_format exposure
Other headers Minor adjustments (e.g., added )

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 38 to 47
template <std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
serializer(const R& record_batches, output_stream& stream)
: m_pstream(&stream)
, m_dtypes(get_column_dtypes(record_batches[0]))
{
if (record_batches.empty())
{
throw std::invalid_argument("Record batches collection is empty");
}
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-batch constructor writes the first record batch twice: once via append(record_batches) (which iterates all elements) after already serializing record_batches[0]. This duplicates the first batch in the output. Either start the append loop from the second element or exclude record_batches[0] from append.

Copilot uses AI. Check for mistakes.

};
m_pstream->reserve(reserve_function);
serialize_schema_message(record_batches[0], *m_pstream);
append(record_batches);
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-batch constructor writes the first record batch twice: once via append(record_batches) (which iterates all elements) after already serializing record_batches[0]. This duplicates the first batch in the output. Either start the append loop from the second element or exclude record_batches[0] from append.

Suggested change
append(record_batches);
if (std::ranges::distance(record_batches) > 1) {
append(std::ranges::subrange(std::next(record_batches.begin()), record_batches.end()));
}

Copilot uses AI. Check for mistakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we first serialize the schema, then we serialize the arrays

Comment on lines 21 to 33
template <std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, sparrow::record_batch>
chunk_serializer(
const R& record_batches,
chuncked_memory_output_stream<std::vector<std::vector<uint8_t>>>& stream
)
: m_pstream(&stream)
{
if (record_batches.empty())
{
throw std::invalid_argument("Record batches collection is empty");
}
m_dtypes = get_column_dtypes(record_batches[0]);
Copy link

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same duplication issue: the first record batch is serialized twice (once explicitly and once through append(record_batches)). Adjust append to skip the first element or iterate from the second batch here.

Copilot uses AI. Check for mistakes.

@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 92.89941% with 36 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@25a9ef6). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/flatbuffer_utils.cpp 91.74% 18 Missing ⚠️
include/sparrow_ipc/any_output_stream.hpp 80.70% 11 Missing ⚠️
include/sparrow_ipc/memory_output_stream.hpp 85.18% 4 Missing ⚠️
src/chunk_memory_serializer.cpp 92.30% 1 Missing ⚠️
src/serialize_utils.cpp 95.23% 1 Missing ⚠️
src/serializer.cpp 95.23% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #29   +/-   ##
=======================================
  Coverage        ?   79.82%           
=======================================
  Files           ?       31           
  Lines           ?     1294           
  Branches        ?        0           
=======================================
  Hits            ?     1033           
  Misses          ?      261           
  Partials        ?        0           
Flag Coverage Δ
unittests 79.82% <92.89%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alex-PLACET Alex-PLACET marked this pull request as ready for review October 7, 2025 11:05
@Alex-PLACET Alex-PLACET force-pushed the add_output_stream_and_serializers branch from 14952d0 to 75c9827 Compare October 7, 2025 12:41
CHECK_EQ(utils::align_to_8(15), 16);
CHECK_EQ(utils::align_to_8(16), 16);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to another test file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from another test file

Copy link
Member

@JohanMabille JohanMabille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is neat, I have just a few remarks regarding the architecture and API.

API

We want to be able to write something like:

std::ofstream out("my_file");
auto serializer ser(out);
ser << my_schema << my_array << my_list_of_batches << end_of_record;

Serializer

This means that:

  • The serializer must accept any kind of streams, including the standard ones (see next section for more detail)
  • The serializer should store an internal state (Waiting for a schema before accepting array ror record batches for instance, etc) so that its constructor does not need to accept a record_batch

The serializer methods names should reflect the stream methods names (i.e. append is actually write); also it should contain all the logic specific to sparrow, like adding padding.

Streams

The hierarchy of streams is actually a hierarchy of stream adaptor. I think this hierarchy can be removed, and some concepts can be provided instead, to help the implementation of the serializer:

template <class T>
concept output_stream = requires(T& t, const char* str)
{
    t.write(str, size_t(0));
    t.flush();
};

template <class T>
concept reservable_output_stream = output_stream<T> && requires(T& t)
{
    t.reserve(size_t(0));
};

If you need a layer between the streams and the serializer to adapt the signatures (because streams accept const char* while you serialize into std::span<uint8_t>, a stream_adapter class can be used to avoid repeating the cast everywhere. This layer should not contain additional logic like add_padding, which should live in the serializer.

Markers

The markers (for indicating the end of a record_batch list for instance) should follow the same pattern as std:endl for instance: a function accepting and returning a serializer.

* operations. Implementations can target different destinations such as files,
* memory buffers, network streams, etc.
*/
class SPARROW_IPC_API output_stream
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this class can be removed now that there is type erasure for streams?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I just removed it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants